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Abstract 

o 

^\| . Nested sampling is a powerful approach to Bayesian inference ultimately limited by 

the computationally demanding task of sampling from a heavily constrained probabil- 
j^'. ity distribution. An effective algorithm in its own right, Hamiltonian Monte Carlo is 

readily adapted to efficiently sample from any smooth, constrained distribution. Uti- 
lizing this constrained Hamiltonian Monte Carlo, I introduce a general implementation 

CN ' of the nested sampling algorithm. 

c^ ; 1 Bayesian Inference 

-(— > 

i-^ ' Theorem, 

•s: pH^,H) = -^^^^ , 

(-] ' The prior. 



o 
o 



X 



Bayesian inferenceis a diverse and robust analysis methodology [H [2] based on Bayes' 



p {a\H) = IT (a) , 

^ ' encodes all knowledge about the parameters a before the data P have been collected, while 

r^ . the likelihood, 

in 



piV\a,H) = Cia), 
defines the probabilistic model of how the data are generated. The evidence, 



p {V\H) = I iTaC (a) vr (a) = Z, 



ensures proper normalization while allowing for auxiliary applications such as model com- 
;_, ■ parison. Lastly, the posterior, 

C^ , 

p{a\V,H) =p{a), 

is the refinement of the prior vr (a) given the information inferred from D. All model 
assumptions are captured by the conditioning hypothesis H. 
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While Bayes' Theorem is simple enough to formulate, in practice the individual compo- 
nents are often sufficiently complex that analytic manipulation is not feasible and one must 
resort to approximation. One of the more successful approximation techniques, Markov 
Chain Monte Carlo (MCMC) produces samples directly from the posterior distribution 
that are often sufficient to characterize even high dimensional distributions. The one man- 
ifest limitation of MCMC, however, is the inability to directly calculate the evidence Z, 
which, as MacKay notes, "is often the single most important number in the problem" [1]. 

Nested sampling [3] is an alternative to sampling from the posterior that instead em- 
phasizes the calculation of the evidence. 

2 Nested Sampling 

Consider the support of the likelihood above a given bound L (Fig[l^, [Ijj), 

Q = {a\C (a) > L}, 
and the associated prior mass across that support (Fig [it). 



x{L)= / d™a7r(a). 

J a 

The differential dx gives the prior mass associated with the likelihood L = C {a) (Fig 

mi), 



dx(L) = d fd"'a7r{a) 

J a 

dx{L) = I d'^a7r(a) 

J da 



I da 

where da is the m — \ dimensional boundary of constant likelihood, 

da = {a\C (a) = L). 

Introducing the coordinate a±_ perpendicular to the likelihood constraint boundary and 
the m — 1 coordinates a\\ parallel to the constraint (Fig [2]), the integral over da simply 
marginalizes an and the differential becomes 



dx(L) = / daxd'^-^a,, vr (a) 

'da 



{L)= I 

J da 

dx (L) = dax / dr-^a\\ vr (a) 

Jda 

dx (L) = da_|_7r {a±) . 
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Figure 1: (a) The prior distribution and likelihood, (b) d, the set of a satisfying the 
likelihood bound £ (a) > L, (c) the prior mass x{L), and (d) the differential prior mass 
dx. 
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Figure 2: Cartoon of aj_ and an in two dimensions, aii parameterizes the contour of 
constant likelihood L while q;j_ parameterizes translations orthogonal to the contour. 

Returning to the evidence, 



Z 
Z 



d^a £ (a) vr (a) 
/"da^d™-^a||£(a)^(a). 



By construction the likelihood is invariant to changes in an, L{a) 
integral simplifies to 



£ (aj.), and the 



/"da^d™-ia||£(ai)7r(Q) 
f da^C{a^) /"d™-ia||7r(a) 
/ da±C{a±)-iT {a±) 
/ da±-ir {a±) C {a±) 
dx L (x) 
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Z 

z 
z 
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where L (x) = C {a± (x)) is the likelihood bound resulting in the prior mass x. 

This clever change of variables has reduced the m dimensional integration over the 
parameters q to a one dimensional integral over the bounded support of x. Although this 
simplified integral is easier to calculate in theory, it is fundamentally limited by the need 
to compute L (x). 



Numerical integration, however, needs only a set of points {xk,Lk) and not L (x) ex- 
plicitly. Sidestepping L{x), consider instead the problem of generating the set {x^^^Lj.) 
directly. 

2.1 Sampling L{x) With The Shrinkage Distribution 

In particular, consider a stochastic approach beginning with n samples drawn from tt (a). 
The sample with the smallest likelihood, >Cmin) bounds the largest x but otherwise nothing 
can be said of the exact value, a^max) without an explicit, and painful, calculation from the 
original definition. 

The cumulative probability of Xmax) however, is simply the probability of Xmax exceeding 
the X of each sample. 



-^ va^maxj 
-^ va^maxj 

-^ v^^maxj 



^ y^l _; a^inaxj ' ' ' r [Xn ^ a^rnaxj 

dxTT {x) • • • / dxvr {x) 

JQ 

dxvr {x) J , 
/ 



where tt (x) is uniformly distributed. 



TT {x 
TT {x 
TT {x 
TT {x 
TT {x 
TT {x 



jm— 1 



da 



ail TT (q (x)) 



da 
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d'"-^a||7r(a(x)) 



dx 
da_L 



dx 



1 



d™-ia||^(a(x)) 
95 vr(Q_L(x)) 

' jm— 1 



VT (a_L (x)) 7a<i 

7r(a_L (xj) 

1, 0<x< 1 
0, otherwise 



d™-^a||7r(a(x)) 



Simplifying, the cumulative probability of the largest sample reduces to 



/ dxTT [x] 

/ dx 



with the corresponding probability distribution 

dP (Xr 



dx 

P V'^maxj — ^X 



max 
n—1 
max' 



Estimating Xmax from the probability distribution p (xmax) immediately yields a pair 

{■^1 •^rnax) ^1 •'-inin/ • 

A second pair follows by drawing from the constrained prior 

I 0, otherwise ' 

or in terms of x, 

1/xi, < X < xi 



0, otherwise 
n samples from this constrained prior yield a new minimum L2 with X2 distributed as 

(\ n— 1 
— 
Xl 

Making another point estimate gives (x2,-^2)- 

Generalizing, the n samples at each iteration are drawn from a uniform prior restricted 
by the previous iteration, 

\ 0, otherwise ' 

The distribution of the largest sample, Xk, follows as before, 

(\ n— 1 

Xk-l 



Note that this implies that the shrinkage at each iteration, tk = Xk/xk-i, is identicahy 
and independently distributed as 

P{tk)=p{t) = ne^~\ 

Moreover, a point estimate for Xk can be written entirely in terms of point estimates for 
the tk, 

Xk Xk-l X2 Xi 

Xk = • • • xo 

Xk-l Xk-2 Xi Xq 

Xk = tk ■ tfe-1 . . .t2 ■ ti ■ Xq 



Xk= I n*M ^0- 



More appropriate to the large dynamic ranges encountered in many applications, log Xk 
becomes 



logXfc = log {'[{ti j 



Xq 



log Xk = ^ log ti + log Xq. 
i=l 

Performing a quick change of variables, the logarithmic shrinkage will be distributed as 



p(logt) = ne"'°st 
with the mean and standard deviation 

1 1 

logt = ±-. 

n n 

Taking the mean as the point estimate for each log ti finally gives 

k 



with the resulting error 



log Xk - log Xq 

n 



(5(logXfc -logxo) = . 

n 



Parameterizing Xk in terms of the shrinkage proves immediately advantageous - because 
the logtj are independent, the errors in the point estimates tend to cancel and the estimate 
for the Xk grow increasingly more accurate with k. 
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Figure 3: (a) The evidence integrand L{x). Note how the bulk resides at exponentially 
small values of x. (b) Taking the mean of the shrinkage distribution, nested sampling 
generates a series of samples (xfc,Lfc) with logXfc_i — logx^ = 1/n. 

At each iteration, then, a pair (xfc,Lfc) is given by the point estimate for x^ and the 
smallest likelihood of the n drawn samples. 



2.2 The Algorithm 

A proper implementation of nested sampling begins with the initial point (xq 
At each iteration, n samples are drawn from the constrained prior 



l,Ln 



0). 



TT (q) OC 



7r(a), 
0, 



C{a) > Ck-i 
otherwise 



and the sample with the smallest likelihood provides a "nested" sample with Lk = C (ak) 
and logXk = —n (Figure [3|). £ (afc) defines a new constrained prior for the following 
iteration. Note that the remaining samples from the given iteration will already satisfy 
this new likelihood constraint and qualify as n — 1 of the samples necessary for the next 
iteration - only one new sample will actually need to be generated. 

As the algorithm iterates, regions of higher likelihood are reached until the nested sam- 
ples begin to converge to the maximum likelihood. Determining this convergence is tricky, 
but heuristics have been developed that are quite successful for well behaved likelihoods 

13 S]. 

Once the iterations have terminated, the evidence is numerically integrated using the 
nested samples. The simplest approach is a first order numerical quadrature: 



k 






e " — e " LL^ 



e " ( 1 — e " I Lfc. 



Errors from the numerical integration are dominated by the errors from the use of point 
estimates and, consequently, higher order quadrature offers little improvement beyond the 
first order approximation. 

The errors inherent in the point estimates can be reduced by instead marginalizing 
over the shrinkage distributions. Note, however, that in many applications the likelihood 
will be relatively peaked and most of the prior mass will lie within its tails, x (L) will 
then be heavily weighted towards exponentially small values of L where the likelihood 
constraint falls below the tails and the prior mass rapidly accumulates. Likewise, the 
integrand L (x) will be heavily weighted towards exponentially small values of x and the 
dominant contributions from the quadrature will come from later iterations, exactly where 
the point estimates become more precise. The resulting error in the integration tends to 
be reasonable, and the added complexity of marginalization offers little improvement. 

The choice of n can also be helpful in improving the accuracy of the integration. For 
larger n the shrinkage distribution narrows and the estimates for the Xk become increas- 
ingly better. Multiple samples at each iteration also prove valuable when the likelihood is 
multimodal, as the individual samples allow the modes to be sampled simultaneously [Ij. 

Lastly, if the a yielding the smallest likelihood are stored with each nested sample then 
posterior expectations can be estimated with the quadrature weights, 

•^ k 

The remaining obstacle to a fully realized algorithm is the matter of sampling from the 
prior given the likelihood constraint C > >Cmin- Sampling from constrained distributions 
is a notoriously difficult problem, and recent applications of nested sampling have focused 
on modifying the algorithm in order to make the constrained sampling feasible [H E]. 
Hamiltonian Monte Carlo, however, offers samples directly from the constrained prior and 
provides an immediate implementation of nested sampling. 



3 Hamiltonian Monte Carlo 

Hamiltonian Monte Carlo [H El [5] is an efficient method for generating samples from the 
771 dimensional probability distribution 

p (x) oc exp [—E (x)] . 
First, consider instead the larger distribution 

p(x,p) =p(x)p(p) 
where the latent variables p are i.i.d. standardized Gaussians 

p(p) ocexp I "2 IpI 
The joint distribution of the initial x and the latent p is then 

P (x, p) oc exp i -- IpI^ - E (x) 
p(x,p) ocexp(-F) 
where 

takes the form of the Hamiltonian of classical mechanics. 
Applying Hamilton's equations 

dx _dH _ 
dt dp 

dp dH „ „ , X 

dt dxL ^ ^ 

to a given sample {x, p} produces a new sample {x',p'}. Note that the properties of 
Hamiltonian dynamics, in particular Liouville's Theorem and conservation of H, guarantee 
that differential probability masses from p (x, p) are conserved by the mapping. As a 
result, this dynamic evolution serves as a transition matrix T (x, p; x', p') with the invariant 
distribution p(x, p). Moreover, the time reversal symmetry of the equations ensures that 
the evolution satisfies detailed balance: 

T (x, p; x', p') = T (x', p'; x, p) . 
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Because H is conserved, however, the transitions are not ergodic and the samples do not 
span the full support of p (x, p). Ergodicity is introduced by adding a Gibbs sampling step 
for the p. Because the x and p are independent, sampling from the conditional distribution 
for p is particularly easy 



p(p|x) =p{p) 

m 

l^(p|x) = J]AA(0,l). 
«=i 

The algorithm proceeds by alternating between dynamical evolution and Gibbs sam- 
pling and the resulting samples {xk,Pk} form a proper Markov chain. 

In practice the necessary integration of Hamilton's equations cannot be performed 
analytically and one must resort to numerical approximations. Unfortunately, any discrete 
approximation will lack the symmetry necessary for both Liouville's Theorem and energy 
conservation to hold, and the exact invariant distribution will no longer be p (x, p). This can 
be overcome by treating the evolved sample as a Metropolis proposal, accepting proposed 
samples with probability 



/ p(x',p' 
P (accept) = min I 1 



P (accept) = min I 1 



p(x,p) 
exp {-H') 
ex.p{-H) 
P (accept) = min (1, exp {—AH)) . 

All transitions to a smaller Hamiltonian, and hence higher probability, are automatically 
accepted. Transitions resulting in a larger Hamiltonian are only occasionally accepted. 

Further implementation details, particularly insight on the choice of step size and total 
number of steps, can be found in [^. 

3.1 Constrained Hamiltonian Monte Carlo 

Now consider the constrained distribution 

p(x), C(x)>0 



^(")"^ 0, else 

Sampling from p (x) is challenging. The simplest approach is to sample from p (x) 
and discard those not satisfying the constraint. For most nontrivial constraints, however, 
this approach is extremely inefficient as the majority of the computational effort is spent 
generating samples that will be immediately discarded. 
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C(x) > C(x) = C(x) < 



C(x) > Cfx) = C(x) < 



C(x) > C(x) = C(x) < 






(a) 



(b) 



(c) 



Figure 4: Cartoon of a particle bouncing off the constraint boundary C{x) = 0. (a) At 
step i + 2 the particle violates the constraint, at which point (b) the normal at Xi^2 is 
computed and the momenta reflected in lieu of the normal leapfrog update, (c) The next 
spatial update is no longer in violation of the constraint. 

From the Hamiltonian point of view, the constraint becomes an infinite potential barrier 



^(x) 



E{x), C (x) > 
oo, else 



Incorporating infinite barriers directly into Hamilton's equations is problematic, but 
physical intuition provides an alternative approach. Particles incident on an infinite barrier 
bounce, the momenta perpendicular to the barrier perfectly reflecting: 



p =Y>T-VN 

p' = p — 2 (p • n) fi. 

Instead of dealing with infinite gradients, then, one can replace the momenta updates with 
reflections when the equations integrate beyond the support of p(x). 

Discrete updates proceed as follows. After each spatial update the constraint is checked 
and if violated then the normal n is computed at the new point and the ensuing momentum 
update is replaced by reflection (Algo [U FigH]). Note that the spatial update cannot be 
reversed, nor can an interpolation to the constraint boundary be made, without spoiling 
the time-reversal symmetry of the evolution. 

For smooth constraints C (x) > the normal is given immediately by 



n 



VC(x) 
|VC(x)| 



The normal for many discontinuous constraints, which are particularly useful for sam- 
pling distributions with limited support without resorting to computationally expensive 
exponential reparameterizations, can be determined by the geometry of the problem. 
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Finally, if the evolution ends in the middle of a bounce, with the proposed sample laying 
just outside of the support of p (x), it is immediately rejected as the acceptance probability 
is zero, 

P (accept) = exp {—AH) = exp (—00) = 0. 

Given a seed satisfying the constraint, the resultant Markov chain bounces around p (x) 
and avoids the inadmissible regions almost entirely. Computational resources are spent on 
the generation of relevant samples and the sampling proceeds efficiently no matter the scale 
of the constraint. 

Algorithm 1 Dynamic evolution with a first order leapfrog discretization of Hamilton's 
equations and constraint C(x) > 0. 

{First momentum half step} 

p ^ p - ^eVE (x) 

for i = to T do 

{Full spatial step} 
X -^ X + ep 

{Check for constraint} 
if C(x) > then 

{Full momentum step} 

p ^ p - eVE (x) 
else 

{Bounce} 

h^ VC(x)/|VC(x)| 

p <— p — 2 (p h) h 
end if 

end for 

{Full spatial step} 
X ^ X + ep 

{Last momentum half step} 
p ^ p - leVE (x) 
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3.2 Application to Nested Sampling 

Constrained Hamiltonian Monte Carlo (CHMC) naturally complements nested sampling 
by taking 



p (xj — )• vr [a) 
C{x)^C{a)-L. 

The CHMC samples are then exactly the samples from the constrained prior necessary for 
the generation of the nested samples. A careful extension of the constraint also allows for 
the addition of a limited support constraint, making efficient nested sampling with, for 
example, gamma and beta priors immediately realizable. 

Initially, the n independent samples are generated from n Markov chains seeded at 
random across the full support of vr (a). After each iteration of the algorithm, the Markov 
chain generating the nested sample is discarded and a new chain is seeded with one of 
the remaining chains. Note that this new seed is guaranteed to satisfy the likelihood 
constraint and the resultant CHMC will have no problems bouncing around the constrained 
distribution to produce the new sample needed for the following iteration. 

A suite of C++ classes implementing nested sampling with CHMC has been developed 
and is available for general usecl 

3.3 Conclusions 

Constrained Hamiltonian Monte Carlo is a natural addition to nested sampling, the com- 
bined implementation allowing efficient and powerful inference for any problem with a 
smooth likelihood. 
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